16 research outputs found
PASTA: Pretrained Action-State Transformer Agents
Self-supervised learning has brought about a revolutionary paradigm shift in
various computing domains, including NLP, vision, and biology. Recent
approaches involve pre-training transformer models on vast amounts of unlabeled
data, serving as a starting point for efficiently solving downstream tasks. In
the realm of reinforcement learning, researchers have recently adapted these
approaches by developing models pre-trained on expert trajectories, enabling
them to address a wide range of tasks, from robotics to recommendation systems.
However, existing methods mostly rely on intricate pre-training objectives
tailored to specific downstream applications. This paper presents a
comprehensive investigation of models we refer to as Pretrained Action-State
Transformer Agents (PASTA). Our study uses a unified methodology and covers an
extensive set of general downstream tasks including behavioral cloning, offline
RL, sensor failure robustness, and dynamics change adaptation. Our goal is to
systematically compare various design choices and provide valuable insights to
practitioners for building robust models. Key highlights of our study include
tokenization at the action and state component level, using fundamental
pre-training objectives like next token prediction, training models across
diverse domains simultaneously, and using parameter efficient fine-tuning
(PEFT). The developed models in our study contain fewer than 10 million
parameters and the application of PEFT enables fine-tuning of fewer than 10,000
parameters during downstream adaptation, allowing a broad community to use
these models and reproduce our experiments. We hope that this study will
encourage further research into the use of transformers with first-principles
design choices to represent RL trajectories and contribute to robust policy
learning
Assessing Quality-Diversity Neuro-Evolution Algorithms Performance in Hard Exploration Problems
A fascinating aspect of nature lies in its ability to produce a collection of
organisms that are all high-performing in their niche. Quality-Diversity (QD)
methods are evolutionary algorithms inspired by this observation, that obtained
great results in many applications, from wing design to robot adaptation.
Recently, several works demonstrated that these methods could be applied to
perform neuro-evolution to solve control problems in large search spaces. In
such problems, diversity can be a target in itself. Diversity can also be a way
to enhance exploration in tasks exhibiting deceptive reward signals. While the
first aspect has been studied in depth in the QD community, the latter remains
scarcer in the literature. Exploration is at the heart of several domains
trying to solve control problems such as Reinforcement Learning and QD methods
are promising candidates to overcome the challenges associated. Therefore, we
believe that standardized benchmarks exhibiting control problems in high
dimension with exploration difficulties are of interest to the QD community. In
this paper, we highlight three candidate benchmarks and explain why they appear
relevant for systematic evaluation of QD algorithms. We also provide
open-source implementations in Jax allowing practitioners to run fast and
numerous experiments on few compute resources.Comment: GECCO 2022 Workshop on Quality Diversity Algorithm Benchmark
Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery
Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for
training neural policies to solve complex control tasks. However, these
policies tend to be overfit to the exact specifications of the task and
environment they were trained on, and thus do not perform well when conditions
deviate slightly or when composed hierarchically to solve even more complex
tasks. Recent work has shown that training a mixture of policies, as opposed to
a single one, that are driven to explore different regions of the state-action
space can address this shortcoming by generating a diverse set of behaviors,
referred to as skills, that can be collectively used to great effect in
adaptation tasks or for hierarchical planning. This is typically realized by
including a diversity term - often derived from information theory - in the
objective function optimized by RL. However these approaches often require
careful hyperparameter tuning to be effective. In this work, we demonstrate
that less widely-used neuroevolution methods, specifically Quality Diversity
(QD), are a competitive alternative to information-theory-augmented RL for
skill discovery. Through an extensive empirical evaluation comparing eight
state-of-the-art methods on the basis of (i) metrics directly evaluating the
skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii)
the skills' performance when used as primitives for hierarchical planning; QD
methods are found to provide equal, and sometimes improved, performance whilst
being less sensitive to hyperparameters and more scalable. As no single method
is found to provide near-optimal performance across all environments, there is
a rich scope for further research which we support by proposing future
directions and providing optimized open-source implementations
Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition)
In 2008 we published the first set of guidelines for standardizing research in autophagy. Since then, research on this topic has continued to accelerate, and many new scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Accordingly, it is important to update these guidelines for monitoring autophagy in different organisms. Various reviews have described the range of assays that have been used for this purpose. Nevertheless, there continues to be confusion regarding acceptable methods to measure autophagy, especially in multicellular eukaryotes. For example, a key point that needs to be emphasized is that there is a difference between measurements that monitor the numbers or volume of autophagic elements (e.g., autophagosomes or autolysosomes) at any stage of the autophagic process versus those that measure fl ux through the autophagy pathway (i.e., the complete process including the amount and rate of cargo sequestered and degraded). In particular, a block in macroautophagy that results in autophagosome accumulation must be differentiated from stimuli that increase autophagic activity, defi ned as increased autophagy induction coupled with increased delivery to, and degradation within, lysosomes (inmost higher eukaryotes and some protists such as Dictyostelium ) or the vacuole (in plants and fungi). In other words, it is especially important that investigators new to the fi eld understand that the appearance of more autophagosomes does not necessarily equate with more autophagy. In fact, in many cases, autophagosomes accumulate because of a block in trafficking to lysosomes without a concomitant change in autophagosome biogenesis, whereas an increase in autolysosomes may reflect a reduction in degradative activity. It is worth emphasizing here that lysosomal digestion is a stage of autophagy and evaluating its competence is a crucial part of the evaluation of autophagic flux, or complete autophagy. Here, we present a set of guidelines for the selection and interpretation of methods for use by investigators who aim to examine macroautophagy and related processes, as well as for reviewers who need to provide realistic and reasonable critiques of papers that are focused on these processes. These guidelines are not meant to be a formulaic set of rules, because the appropriate assays depend in part on the question being asked and the system being used. In addition, we emphasize that no individual assay is guaranteed to be the most appropriate one in every situation, and we strongly recommend the use of multiple assays to monitor autophagy. Along these lines, because of the potential for pleiotropic effects due to blocking autophagy through genetic manipulation it is imperative to delete or knock down more than one autophagy-related gene. In addition, some individual Atg proteins, or groups of proteins, are involved in other cellular pathways so not all Atg proteins can be used as a specific marker for an autophagic process. In these guidelines, we consider these various methods of assessing autophagy and what information can, or cannot, be obtained from them. Finally, by discussing the merits and limits of particular autophagy assays, we hope to encourage technical innovation in the field
Adaptive optimization problems under uncertainty with limited feedback
Thesis: Ph. D., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2017.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 159-166).This thesis is concerned with the design and analysis of new algorithms for sequential optimization problems with limited feedback on the outcomes of alternatives when the environment is not perfectly known in advance and may react to past decisions. Depending on the setting, we take either a worst-case approach, which protects against a fully adversarial environment, or a hindsight approach, which adapts to the level of adversariality by measuring performance in terms of a quantity known as regret. First, we study stochastic shortest path problems with a deadline imposed at the destination when the objective is to minimize a risk function of the lateness. To capture distributional ambiguity, we assume that the arc travel times are only known through confidence intervals on some statistics and we design efficient algorithms minimizing the worst-case risk function. Second, we study the minimax achievable regret in the online convex optimization framework when the loss function is piecewise linear. We show that the curvature of the decision maker's decision set has a major impact on the growth rate of the minimax regret with respect to the time horizon. Specifically, the rate is always square root when the set is a polyhedron while it can be logarithmic when the set is strongly curved. Third, we study the Bandits with Knapsacks framework, a recent extension to the standard Multi-Armed Bandit framework capturing resource consumption. We extend the methodology developed for the original problem and design algorithms with regret bounds that are logarithmic in the initial endowments of resources in several important cases that cover many practical applications such as bid optimization in online advertising auctions. Fourth, we study more specifically the problem of repeated bidding in online advertising auctions when some side information (e.g. browser cookies) is available ahead of submitting a bid. Optimizing the bids is modeled as a contextual Bandits with Knapsacks problem with a continuum of arms. We design efficient algorithms with regret bounds that scale as square root of the initial budget.by Arthur Flajolet.Ph. D
Real-Time Bidding with Side Information
© 2017 Neural information processing systems foundation. All rights reserved. We consider the problem of repeated bidding in online advertising auctions when some side information (e.g. browser cookies) is available ahead of submitting a bid in the form of a d-dimensional vector. The goal for the advertiser is to maximize the total utility (e.g. the total number of clicks) derived from displaying ads given that a limited budget B is allocated for a given time horizon T. Optimizing the bids is modeled as a contextual Multi-Armed Bandit (MAB) problem with a knapsack constraint and a continuum of arms. We develop UCB-type algorithms that combine two streams of literature: the confidence-set approach to linear contextual MABs and the probabilistic bisection search method for stochastic root-finding. Under mild assumptions on the underlying unknown distribution, we establish distribution-independent regret bounds of order Õ(d · √T) when either B = ∞ or when B scales linearly with T
Robust Adaptive Routing Under Uncertainty
© 2017 INFORMS. We consider the problem of finding an optimal history-dependent routing strategy on a directed graph weighted by stochastic arc costs when the objective is to minimize the risk of spending more than a prescribed budget. To help mitigate the impact of the lack of information on the arc cost probability distributions, we introduce a robust counterpart where the distributions are only known through confidence intervals on some statistics such as the mean, the mean absolute deviation, and any quantile. Leveraging recent results in distributionally robust optimization, we develop a general-purpose algorithm to compute an approximate optimal strategy. To illustrate the benefits of the robust approach, we run numerical experiments with field data from the Singapore road network
Online learning with a hint
© 2017 Neural information processing systems foundation. All rights reserved. We study a variant of online linear optimization where the player receives a hint about the loss function at the beginning of each round. The hint is given in the form of a vector that is weakly correlated with the loss vector on that round. We show that the player can benefit from such a hint if the set of feasible actions is sufficiently round. Specifically, if the set is strongly convex, the hint can be used to guarantee a regret of O(log(T)), and if the set is q-uniformly convex for q ∈ (2, 3), the hint can be used to guarantee a regret of o(√T). In contrast, we establish Ω(VT) lower bounds on regret when the set of feasible actions is a polyhedron.Office of Naval Research (Grant N00014-15-1-2083